[Train] Simplify single worker training #19814

amogkam · 2021-10-28T00:55:17Z

Currently, Ray Train does not setup the distributed environment (torch process group, TF_CONFIG env var) if only using 1 worker.

However, this requires the user to make changes to their training code if they want to go from 1 worker to multiple workers, and has been a source of confusion in our examples:
#19506
#19761

This PR changes the behavior to setup the distributed environment regardless of the number of workers. This allows training functions that have DistributedDataParallel or MultiWorkerMirroredStrategy to still work with single worker Ray Train. This PR also adds testing the quick start code examples in the docs.

Closes #19761

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/train/backends/torch.py

matthewdeng

This is awesome. Really like this pattern of running the documentation code in CI.

Can you add instructions and link this an example in [OSS] How to write an example? This is really a best-practice I think we should all be following. (Also be sure to include the trick of moving the function calls to __main__!)

amogkam added 2 commits October 27, 2021 17:47

wip

cc09c4f

update

76b291d

amogkam assigned matthewdeng Oct 28, 2021

matthewdeng reviewed Oct 28, 2021

View reviewed changes

python/ray/train/backends/torch.py Show resolved Hide resolved

amogkam added 4 commits October 27, 2021 18:50

fix

8e1161f

fix

2d8ba11

fix

e7e8554

fix

5176d47

matthewdeng approved these changes Oct 28, 2021

View reviewed changes

amogkam merged commit 1803d88 into ray-project:master Oct 28, 2021

amogkam deleted the train-single-worker branch October 28, 2021 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Simplify single worker training #19814

[Train] Simplify single worker training #19814

amogkam commented Oct 28, 2021 •

edited

Loading

matthewdeng left a comment

[Train] Simplify single worker training #19814

[Train] Simplify single worker training #19814

Conversation

amogkam commented Oct 28, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

amogkam commented Oct 28, 2021 •

edited

Loading